Rollout Allocation Strategies for Classification-based Policy Iteration
نویسندگان
چکیده
Classification-based policy iteration algorithms are variations of policy iteration that do not use any kind of value function representation. The main idea is 1) to replace the usual value function learning step with rollout estimates of the value function over a finite number of states, called the rollout set, and the actions in the action space, and 2) to cast the policy improvement step as a classification problem. The choice of rollout allocation strategies over states and actions has significant impact on the performance and computation time of this class of algorithms. In this paper, we present new strategies to allocate the available budget (number of rollouts) at each iteration of the algorithm over states and actions. Our empirical results indicate that for a fixed budget, using the proposed strategies improves the accuracy of the training set over the existing methods.
منابع مشابه
Classification-based Policy Iteration with a Critic
In this paper, we study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use a critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the action-value function. Therefore, the introduction of a ...
متن کاملAlgorithms and Bounds for Rollout Sampling Approximate Policy Iteration
Several approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supervised learning problem, have been proposed recently. Finding good policies with such methods requires not only an appropriate classifier, but also reliable examples of best actions, covering the state space sufficiently. Up to this ti...
متن کاملRollout Sampling Policy Iteration for Decentralized POMDPs
We present decentralized rollout sampling policy iteration (DecRSPI)–a new algorithm for multiagent decision problems formalized as DECPOMDPs. DecRSPI is designed to improve scalability and tackle problems that lack an explicit model. The algorithm uses Monte-Carlo methods to generate a sample of reachable belief states. Then it computes a joint policy for each belief state based on the rollout...
متن کاملDifferential Training of 1 Rollout Policies
We consider the approximate solution of stochastic optimal control problems using a neurodynamic programming/reinforcement learning methodology. We focus on the computation of a rollout policy, which is obtained by a single policy iteration starting from some known base policy and using some form of exact or approximate policy improvement. We indicate that, in a stochastic environment, the popu...
متن کاملAverage-Case Performance of Rollout Algorithms for Knapsack Problems
Rollout algorithms have demonstrated excellent performance on a variety of dynamic and discrete optimization problems. Interpreted as an approximate dynamic programming algorithm, a rollout algorithm estimates the value-to-go at each decision stage by simulating future events while following a heuristic policy, referred to as the base policy. While in many cases rollout algorithms are guarantee...
متن کامل